Intro Machine Learning

Crash Course - JEST Internal Workshop

Tiago Tamagusko

02 Sep 2022

About me

  • Transportation Specialist
  • Self-taught Data Scientist
  • PhD Candidate @ UC
  • https://tamagusko.com
  • Research areas:
    • Machine Learning
    • Computer Vision

Workshop outline

  • Intro (5 minutes)
  • Machine learning basics (25 minutes)
  • A data science project (5 minutes)
  • Hands-on: Traditional vs Machine learning (10 minutes)
  • Break: 5 minutes
  • Practical example (40 minutes)

Total: 1H30

Machine Learning

Can machines thinks?


“Machine Learning algorithms enable the computers to learn from data, and even improve themselves, without being explicitly programmed” (Arthur Samuel).

Traditional vs ML

def function(*args):
    if args[0] > c0:
        if args[1] > c1:
            if args[2] > c2:
                ...
            else: 
                pass
        else:
            pass
    else:
        if args[1] > c1:
            pass
            if args[2] > c2:
                ....
        else: 
            pass
            
# Note: pass = do something
import pandas as pd
from sklearn.BRANCH import MODEL_NAME
from sklearn.metrics import METRIC_NAME
from sklearn.model_selection import train_test_split

df = pd.read_csv('data.csv')  # load data

# Split data into train/test (70/30)
X, y = df.drop(['TARGET'], axis=1), df['TARGET']
X_train, X_test, y_train, y_test = train_test_split(
    X, y, train_size=0.7, random_state=42)

MODEL = MODEL_NAME()  # build model
MODEL.fit(X_train, y_train)  # train model
y_pred_MODEL = MODEL.predict(X_test)  # result

METRIC = METRIC_NAME((y_test, y_pred_MODEL)) # eval

Methods

by Abdul Rahid

Methods

Choosing model

scikit-learn: Choosing the right estimator

Overfitting & Underfitting

Variance & Bias

by Satya Mallick, 2021

  • High bias: underfitting
  • High variance: overfitting

Algorithms

Decision Tree

By synergy37AI, in Medium: Decision Trees: Lesson 101

Random Forest

By Tony Yiu, in Medium: Understanding Random Forest

K Nearest Neighbors (KNN)

Note: Small K can generate overfitting (learns from noise), large K can generate underfitting (neighbors too far away).

Data Science Project

Hands-on

Data

import pandas as pd

df = pd.read_csv('data/jobs.csv')
# Note: True = 1, False = 0

print(df)
             jobs  salary    places  coffee  accept
0    data-analyst    2500    remote   False   False
1  data-scientist    3000    remote   False   False
2  data-scientist    3500    remote   False    True
3  data-scientist    5000  New York    True    True
4   data-engineer    4000    Madrid    True    True
5          devops    3500    Lisbon   False   False
6       qa-tester    3800     Porto    True   False

Problem

Accept job offer based on location (remote or face-to-face), salary and coffee!

Conditions:

  • $3500+ with remote work.
  • $4000+ with free coffee!
  • $4500+ and I buy my Nespresso.

Traditional programming

def accept_offer(places: str, salary: int, coffee: bool):
    if places == "remote" and salary >= 3500:
        return True
    else:
        if salary >= 4000 and coffee:
            return True
        else:
            if salary >= 4500:
                return True
            else:
                return False

offer1 = accept_offer("New York", 4400, False)
offer2 = accept_offer("Berlin", 5000, False)
offer3 = accept_offer("remote", 3600, False)

print(f"[{offer1} {offer2} {offer3}]")
[False True True]

Machine Learning

import pandas as pd
from sklearn.ensemble import RandomForestClassifier

df = pd.read_csv('data/jobs.csv') # load data

# Preprocessing data
df['places'].replace({'remote': 1}, inplace=True)
df['places'].replace(to_replace=r'\D+', value='0', regex=True, inplace=True)

X, y = df.drop(['jobs', 'accept'], axis=1), df.accept

MODEL = RandomForestClassifier(random_state=1)  # build model
MODEL.fit(X, y)  # train model

offers = {'salary': [4400, 5000, 3600],
          'places': [0, 0, 1], 
          'coffee': [0, 0, 0]}
offer = pd.DataFrame(offers)

print(MODEL.predict(offer))
[False  True  True]

Recap

  • Traditional programming:
    • Explicitly programmed rules
  • Machine learning:
    • Learning from data (input + label) - Supervised learning
    • Label/cluster data (input) - Unsupervised learning
    • Learn from mistakes and successes (environment) - Reinforcement learning

Break

5 minutes

Practical example

Thanks!


@tamagusko on LinkedIn to stay in touch!


Course repository

or